Segmenting A Sentence Into Morphemes Using Statistic Information Between Words

نویسندگان

Shiho Nobesawa

Junya Tsutsumi

Tomoaki Nitta

Kotaro Ono

Sun Da Jiang

Masakazu Nakanishi

چکیده

This paper is on dividing non-separated language sentences (whose words are not separated from each other with a space or other separaters) into morphemes using statistical information, not grammatical information which is often used in NLP. In this paper we describe our method and experimental result on Japanese and Chinese se~,tences. As will be seen in the body of this paper, the result shows that this systent is etlicient for most of tile sentences. 1 I N T R O D U C T I O N A N D M O T I V A T I O N An English sentence has several words and those words are separated with a space, i t is e~usy to divide an English sentence into words. I[owever a a apalmse sentence needs parsing if you want to pick up the words in the sentence. This paper is on dividing non-separated language sentences into words(morphemes) without using any grammatical information. Instead, this system uses the stat is t ic information between morphenws to select best ways of segmenting sentences in nonseparated languages. Thin ldng about segmenting a sentence into pieces, it is not very hard to divide a sentence using a certain dictionary for that . The problem is how to decide which ' segmenta t ion ' the t)est answer is. For examl)le , there must be several ways of segmenting a Japanese sentence wri t ten in ll iragana(Jal)a,lese alphabet) . Maybe a lot more than 'several'. So, to make the segmenting system useful, we have to cot> sider how to pick up the right segmented sentences from all the possible seems-like-scgrne, nted sentences, This system is to use statist ical inforn,ation between morphemes to see how 'sentence-like'(how 'likely' to happen a.s a sentence) the se.gmented string is. To get the statist ical association between words, mutual information(MI) comes to be one of the most interesting method. In this paper MI is used to calculate the relationship betwee.n words found ill the given sentence. A corpus of sentences is used to gain the MI. 'Fo implement this method, we iml)lemented a system MSS(Morphological Segmentat ion using Statistical information). W h a t MSS does is to find the best way of segmenting a non-separated language, sentence into morphemes without depending on granamatieal information. We can apply this system to many languages. ~2 ) / [ O R P H O L O G I C A L A N A L Y S I S 2 . 1 W h a t ; a M o r p h o l o g i c a l A n a l y s i s I s A morpheme is the smallest refit of a str ing of characters which has a certain linguistic l/leaning itself. It includes both content words and flmction words, in this l)aper the definition of a morl)heme is a string of characters which is looked u I) in tile dictionary. Morphoh)gical analysis is to: l) recognize the smallest units making up tile given sentellce if the sentence is of a l |on-separated hmguage, divide the sentence into morphenms (automatic segmentat ion) , and 2) check the morlflmmes whether they are the right units to make up the sentence. 2 . 2 S e g m e n t i n g M e t h o d s We have some ways to segment a non-separated sentence into meaningflll morphemes. These three methods exl)lained below are the most popular ones to segment ,I apanese sentences. • T h e l o n g e s t s c ' g m e n t m e t h o d : l~,ead the given sentence fi'om left to right and cut it with longest l)ossible segment. For exampie, if we get ' isheohl ' first we look for segments wilich uses t h e / i r s t few lette,'s in it , ' i ' and 'is'. it is ol)vious tha t 'i';' is loIlger thall ' i ' , SO tile system takes 'is' as the segment. Then it tries the s;tllle method to find the segnlents in 'heold' and tinds 'he' and 'old'. The, least-bunsetsu s e g m e n t i n g m( ' , thod: Get all the possible segmentat ions of the input sentence and choose the segmentat ion(s) which has least buusetsu in it.. ' l 'his method is to seg:ment Japanese sentence.s, which have content words anti function words together in one bunsetsu most of the time. This method helps not to cut a se, ntenee into too small meaningless pieces. Let tm'ty l )e , s e g m e n t i n g m e t h o d : In Japanese language we have three kinds of letters called Iliragana, Katakana and Kanji. This

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Segmenting A Sentence Into Morpiiemes Using Statistic Information Between Words

متن کامل

Segmenting Sentences into Linky Strings Using D-bigram Statistics

It is obvious that segmentation takes an important role in natural language processing(NLP), especially for the languages whose sentences are not easily separated into morphemes. In this s tudy we propose a method of segmenting a sentence. The system described in this paper does not use any grammatical information or knowledge in processing. Instead, it uses statistical information drawn from n...

متن کامل

Why knowledge of morphemes matters for deaf children’s literacy

In order to learn to read, children must learn to think about the English language. Children’s awareness of the sounds of words helps them to understand the alphabetic principle. There are many sources of information for teachers about how to help deaf children become aware of the sounds of English. The research reported here is about how to help deaf children think in a different way about the...

متن کامل

Composition and Decomposition of Japanese Katakana and Kanji Morphemes for Decision Rule Induction from Patent Documents

We propose a new method to construct a word list for rule induction from Japanese patent documents. For word segmentation in Japanese, statistical morphological analyzers have been used in many applications. However, the output of these morphological analyzers presents defects when analyzing unknown words, specifically words that contain Kanji/Katakana morphemes. Some words are overly segmented...

متن کامل

Aligning and Using an English-Inuktitut Parallel Corpus

A parallel corpus of texts in English and in Inuktitut, an Inuit language, is presented. These texts are from the Nunavut Hansards. The parallel texts are processed in two phases, the sentence alignment phase and the word correspondence phase. Our sentence alignment technique achieves a precision of 91.4% and a recall of 92.3%. Our word correspondence technique is aimed at providing the broades...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1994

Segmenting A Sentence Into Morphemes Using Statistic Information Between Words

نویسندگان

چکیده

منابع مشابه

Segmenting A Sentence Into Morpiiemes Using Statistic Information Between Words

Segmenting Sentences into Linky Strings Using D-bigram Statistics

Why knowledge of morphemes matters for deaf children’s literacy

Composition and Decomposition of Japanese Katakana and Kanji Morphemes for Decision Rule Induction from Patent Documents

Aligning and Using an English-Inuktitut Parallel Corpus

عنوان ژورنال:

اشتراک گذاری